Motivation

In the last two lectures, we were analyzing health care data in the United States. We were exploring the following questions:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?
  2. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?
  3. How do healthcare coverage and spending relate to life expectancy?

The data is from the Henry J Kaiser Family Foundation (KFF).

Let’s recall what we did before:

Load in the tidyverse of R package

Read in the datasets with the readr R package

Convert the coverage and spending datasets from wide to long formats using the tidyr R package

Wrangle the datasets using some dplyr action verbs

Add abbreviation and region information to the coverage dataset

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Join the datasets together

Add created two new columns prop_coverage and spending_capita

## # A tibble: 612 x 11
##    Location          year type    tot_coverage abb   region   tot_spending
##    <chr>            <int> <chr>          <int> <chr> <fct>           <int>
##  1 Alabama           2013 Employ…      2126500 AL    South           33788
##  2 Alaska            2013 Employ…       364900 AK    West             7684
##  3 Arizona           2013 Employ…      2883800 AZ    West            41481
##  4 Arkansas          2013 Employ…      1128800 AR    South           20500
##  5 California        2013 Employ…     17747300 CA    West           278168
##  6 Colorado          2013 Employ…      2852500 CO    West            34090
##  7 Connecticut       2013 Employ…      2030500 CT    Northea…        34223
##  8 Delaware          2013 Employ…       473700 DE    South            9038
##  9 District of Col…  2013 Employ…       324300 DC    South            7443
## 10 Florida           2013 Employ…      8023400 FL    South          150547
## # ... with 602 more rows, and 4 more variables: life_exp_years <dbl>,
## #   tot_pop <int>, prop_coverage <dbl>, spending_capita <dbl>

Let’s learn a little bit more about ggplot2 to answer those questions using the tools we have learned in the last two lectures

Data Visualization

The ggplot2 R package

In the previous lecture, we learned about

  1. the main ggplot() function
  2. aesthic mappings or or aes (connects the data we see on the graph; used as input to the geometry functions)
  3. geometry functions or geoms (adding layers on top of the ggplot() object)
  4. scales or how the data will be plotted (e.g. continuous, discrete, log, etc)

Creating scatter plots using geom_point()

We also created our first ggplot() object to explore the question

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

It would be nice to know which state is represented by which state. For this, we will introduce another geom called geom_text().

Adding layers of text using geom_text()

In our dataset, we have information about the abbreviation for each state. We could add the abbreviations for each state next to the point on the plot to assess which states have a higher or lower coverage for a given amount of money they spend per capita.

That is cool, but it would be even better if we could nudge the text over a bit. Let’s look at the help file for geom_text():

We see there is an argument called nudge_x and nudge_y. We can use these to nudge the text over a bit so the text is not directly on top of the points.

Your turn

Color each point (or state) by what region they are from.

faceting across a variable

Ok, getting back to our original question:

  1. Is there a relationship between healthcare coverage and healthcare spending in the United States?

We saw there was a positive relationship, but this was only for one type of healthcare coverage (Employer) and one year. What about the other types?

For this, we will introduce facets. The idea of faceting is to stratify the data by some variable and make the same plot for each strata.

For example, if we wanted to facet by the type variable, we will add a layer to our ggplot() object using the facet_grid() or facet_wrap() functions. The function expects the row and column variables to be separated by a ~.

## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing missing values (geom_text).

We see that the proportion of people covered have different scales in the y-axis. Let’s read the help file to see if there is some way to not restrict the y-axis to be the same.

Yes, we see there is an argument called scales that can be free_y, (free columns), free_x (free rows), and free (both). Let’s try free_y and look at a different year (year=="2014"):

## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_text).

Given we know Other Public refers to the military or Veterans Adminstration, we can see states like HI, VA, NV have a larger proportion of military or VA Other Public type coverage. While a state like AK has a similar proportion of Other Public coverage, it has a much larger spending per capita.

We also see a negative relationship with the Uninsured type. The more states spend, the less uninsured people in the state.

Creating boxplots using geom_boxplot()

Next, let’s revisit the second question.

  1. Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?

Let’s try making a boxplot with ggplot2. If you recall, the way to do this in base R was:

Now, we introduce the geom_boxplot() function. Note, we needed to tell ggplot2 what needs to be along the x and y axis in aes().

faceting by two variables

  1. How do healthcare coverage and spending relate to life expectancy?

In our last section, let’s first explore the relationship between spending and life expectancy using facets. We were faceting by healthcare type, so let’s stick with that, but change the y-axis to life_exp_years.

Hmm… So looks like there is a positive association for all the types (more you spend the longer the life expectacy). But there is something strange about the fact that the regions have very different life expectancies.

Let’s try faceting by both region and type. Note that we can facet by rows putting a column name before the ~ and facet by columns putting a column name after the ~. We are also using facet_grid() instead of facet_wrap().

What trends do you see?

Finally, let’s do the same for the prop_coverage column instead of spending_capita column to explore the relationship between life expectancy

## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_text).

What trends do you see? How would you summarize these results? What’s most interesting?

ggplot2 cheatsheet

Mini-Case Study 1

For the last 20 mins of class, we will break into groups and work on a mini-case study. You can find the case study on the course website.